[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446
[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446jucor wants to merge 7 commits into
Conversation
0f32d4c to
abfeacc
Compare
e2f39bb to
89297cf
Compare
There was a problem hiding this comment.
Pull request overview
Updates Delphi’s repness significance thresholds to use one-tailed z-score cutoffs (aligning with the Clojure implementation and existing stats.py expectations), and refreshes affected tests and golden snapshots.
Changes:
- Update
Z_90/Z_95constants inrepness.pyto one-tailed thresholds (1.2816 / 1.6449). - Un-xfail and adjust unit/discrepancy tests to assert the new thresholds (including boundary assertions).
- Re-record regression golden snapshot data reflecting the new repness behavior.
Reviewed changes
Copilot reviewed 19 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/polismath/pca_kmeans_rep/repness.py |
Switch Z-score thresholds to one-tailed constants and update inline documentation. |
delphi/tests/test_repness_unit.py |
Update z-score significance unit tests for the new thresholds. |
delphi/tests/test_old_format_repness.py |
Update backwards-compat repness tests for the new thresholds. |
delphi/tests/test_discrepancy_fixes.py |
Remove D9 xfail markers now that thresholds match expected values. |
delphi/tests/simplified_repness_test.py |
Update the script constant to the new Z_90 value. |
delphi/real_data/r6vbnhffkxbd7ifmfbdrd-vw/golden_snapshot.json |
Update golden snapshot outputs after repness threshold change. |
delphi/docs/PLAN_DISCREPANCY_FIXES.md |
Add task parallelization notes related to discrepancy fix sequencing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
89297cf to
403be8d
Compare
9149f7f to
e3c5893
Compare
403be8d to
464da16
Compare
db36889 to
69350d5
Compare
69350d5 to
382de2f
Compare
f8a7007 to
19cce44
Compare
6b81d8e to
6da6172
Compare
19cce44 to
bf2dd99
Compare
6da6172 to
0ef54ca
Compare
bf2dd99 to
f8c5793
Compare
0ef54ca to
b62c0cd
Compare
f8c5793 to
7f733c1
Compare
b62c0cd to
83bfe23
Compare
7f733c1 to
c920b61
Compare
09747ea to
8d94246
Compare
4ea1333 to
917cba8
Compare
917cba8 to
e45a120
Compare
8d94246 to
9397ddf
Compare
9397ddf to
e96a1f7
Compare
Delphi Coverage Report
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Returns: | ||
| True if significant at 90% confidence | ||
| """ | ||
| return abs(z) >= Z_90 | ||
| return z > Z_90 | ||
|
|
There was a problem hiding this comment.
The significance helpers were switched to one-tailed semantics (z > Z_90), but Z_90 is still set to 1.645 (two-tailed 90%). To match Clojure’s z-sig-90? (> 1.2816), update the constant (and its comment) to 1.2816; otherwise the new one-tailed check remains overly conservative and D9 parity tests can’t pass once enabled.
| Returns: | ||
| True if significant at 95% confidence | ||
| """ | ||
| return abs(z) >= Z_95 | ||
| return z > Z_95 | ||
|
|
There was a problem hiding this comment.
Same issue as Z_90: z_score_sig_95 now uses one-tailed strict z > Z_95, but Z_95 is still 1.96 (two-tailed 95%). Clojure’s z-sig-95? uses 1.6449; update Z_95 accordingly so 95% gating matches the reference implementation.
| @@ -562,6 +564,20 @@ def test_z95_matches_clojure(self): | |||
| check.almost_equal(Z_95, 1.6449, abs=0.001, | |||
| msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}") | |||
There was a problem hiding this comment.
The D9 threshold value assertions are still marked @pytest.mark.xfail, which means this PR’s stated purpose (changing Z_90/Z_95 values) isn’t actually enforced by CI. Once the constants are updated, these xfails should be removed so regressions in the threshold values will fail the suite.
| z_score = two_prop_test(70, 100, 50, 100) # 70/100 vs 50/100 | ||
| print(f"Comparison Z-score: {z_score}") | ||
| ``` | ||
| Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are |
There was a problem hiding this comment.
This doc references z_sig_90/z_sig_95, but repness.py defines z_score_sig_90/z_score_sig_95 (and stats.py has been removed). Update the names here (or add documented aliases) so the docs reflect the actual public API.
| Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are | |
| Statistical helpers (`z_score_sig_90`, `z_score_sig_95`, `prop_test`, `two_prop_test`) are |
b68bd5b to
583f955
Compare
e96a1f7 to
574c169
Compare
Map dependency graph and file boundaries for D5-D12. Two parallel tracks possible: repness formulas (D5→D6→D7→D8→D10→D11) and conversation/PCA (D3→D15→D12), with D1/D1b after both. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ead stats.py repness.py: change z_score_sig_90/95 from abs(z) >= threshold to z > threshold, matching Clojure's (> z-val 1.2816). Also fix inline significance filters in select_rep_comments_df to use > without .abs(). Remove stats.py and test_stats.py — unused dead code from the original AI-generated Clojure port. repness.py defines its own z-sig functions and never imported stats.py. Update usage_examples.md to point to repness.py instead of deleted stats.py. Add D9 blob comparison tests (significance sets and z-values, xfail until D5/D6), D5/D6/D7/D8 blob comparison tests with shared_count guards, tid type fixes, and D9 unit tests for strict > and one-tailed semantics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Require side-by-side Clojure/Python verification for every formula change - Exhaustive RED phase: boundary conditions, edge cases, missing test audit - Double-check array shapes, indices, aggregation axes after implementation - Allow skipping very large private datasets during iteration, only run as final validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR 14 (vectorized code refactor) is now a prerequisite for all formula fix PRs, not a post-parity cleanup. It branches off jc/clj-parity-d9-fix (Stack 13) to make the vectorized production path readable and testable against Clojure blob values. Remaining dead code cleanup split to PR 14b. Handoff doc at delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The D9 z-score threshold change (two-tailed → one-tailed) combined with upstream fixes cascaded into this branch changed the repness output enough to invalidate the vw golden snapshot. Biodiversity re-recorded too for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
574c169 to
b64cae8
Compare
583f955 to
380b00d
Compare
|
Superseded by spr-managed PR stack. See the new stack starting at #2508. |
* Vectorize participant info computation (3-15x speedup) ## Summary - Replaces the O(N×G×C) per-participant Python loop in `_compute_participant_info_optimized` with bulk NumPy operations: matrix-wide vote counting (`np.sum` over axis) and per-group Pearson correlation via `P @ g` matrix multiply - Adds 31 unit tests covering vote counts, group correlations, edge cases (small groups, zero-std, NaN handling, missing members), and golden snapshot regression - Correlations now return Python `float` instead of `numpy.float64` - Includes a benchmark script (`scripts/benchmark_participant_info.py`) that runs old vs new on the same data ## Benchmark results Measured on real datasets (5 runs, median), old per-participant loop vs new vectorized: | Dataset | Size | Old | New | Speedup | |---------|------|-----|-----|---------| | vw | 69p × 125c × 4g | 0.011s | 0.001s | **14.6x** | | biodiversity | 536p × 314c × 2g | 0.047s | 0.006s | **8.1x** | | _(larger private datasets)_ | | | | **3–6x** | Speedup is higher on smaller datasets (loop overhead dominates) and lower on very large ones (matrix materialization dominates). Overall **3–15x** depending on size. ## Test plan - [x] 31 unit tests pass (pre-vectorization baseline established first, then re-run post) - [x] Golden snapshot regression passes for biodiversity + vw - [x] Full regression test suite passes (40/40) - [x] Benchmark run on all datasets including private (results above) - [x] Max correlation diff across all datasets: < 2e-15 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Address Copilot review: clarify MAP estimate, centralize PSEUDO_COUNT imports - Update plan: add stack cross-reference and GitHub PR numbers - Vectorize _compute_participant_info_optimized for ~100x speedup - Add benchmark script for participant info vectorization - Address Copilot review: drop unused cache, validate --runs, tighten type assert - Remove old participant_stats() in favor of vectorized replacement commit-id:ea747196 * Fix test DB connection: use DATABASE_URL with dotenv ## Summary - **Absorbed PR #2434** (DB connect_timeout): adds `connect_timeout=5` to all DB connections so tests fail fast instead of hanging when Postgres is unavailable. Reverts timeout from production code (a 5s timeout could break real deployments), keeping it only in test code. - Adds `require_dynamodb()` and `require_s3()` helpers to conftest that probe services with short timeouts and `pytest.fail()` immediately — applied to all tests that previously hung when Docker services were down. - Adds `pytest-timestamper` for per-test timing visibility. - Fixes `test_postgres_real_data.py` to use `DATABASE_URL` (consistent with all other delphi Python code) via `python-dotenv`. Refs #2442 ## Test plan - [x] `test_conversation_from_postgres` passes - [x] `test_pakistan_conversation_batch` passes (9 min, 400K votes) - [x] Full delphi test suite: 294 passed, no regressions 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Add connect_timeout to all DB connections, fail tests on DB unavailable - Fail tests fast when services are unavailable instead of hanging - Fix conftest docstring: require_dynamodb/require_s3, not require_service - Skip (not fail) test_minio_access when MinIO is unavailable - Fail test_conversation_from_postgres fast when DynamoDB is unavailable - Fail test_run_math_pipeline_e2e fast when DynamoDB is unavailable - Add pytest-timestamper and fix test_501 hang on Docker unavailable - Defer heavy import in test_umap_narrative_pipeline (defensive) - Address Copilot review: fix timing_stats_compared None bug, fix return type - Fix test DB connection: use DATABASE_URL with dotenv - Fix CI: use find_dotenv() and fall back to DATABASE_* vars - Restore get_or_compute_conversation fixture and test reordering hook - Remove dead xdist_group markers from make_dataset_params commit-id:b9062b50 * Fix D9: z-score thresholds from two-tailed to one-tailed ## Summary - Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's `stats.clj` - `Z_90`: 1.645 → 1.2816, `Z_95`: 1.96 → 1.6449 - Also resolves an internal inconsistency — Python's own `stats.py` already used the correct one-tailed values ## Why one-tailed? The proportion tests in Polis check whether a comment's agree (or disagree) rate is **significantly above 0.5** — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance. ## Test plan - [x] TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green - [x] Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private) - [x] Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded) - [x] Repness unit tests: 36 passed (boundary values updated to match new thresholds) - [x] 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Plan: add task parallelization analysis for remaining fixes - Fix D9: match Clojure z-sig semantics (strict >, no abs) and remove dead stats.py - Re-record vw golden snapshot after D9 z-sig semantics change - Update plan: mark D9 as done, note stats.py removal for next PR - Add mathematical rigor and exhaustive testing guidance to fix plan - Plan: move PR 14 earlier (prerequisite for blob tests) + add handoff doc - Re-record golden snapshots after upstream cascade commit-id:0194003d * Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) ## Summary Replace Python's standard one-proportion z-test `prop_test(p, n, p0)` with Clojure's Wilson-score-like formula `prop_test(succ, n)` from `stats.clj:10-15`: ``` 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) ``` The Clojure formula has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1) prior) that regularizes extreme values for small Polis groups. This is separate from the `PSEUDO_COUNT=2.0` used for `pa`/`pd` estimation (Beta(2,2) prior): - `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation - `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing **What changed in the output**: `pat`, `pdt` values (proportion test z-scores), and downstream `agree_metric` / `disagree_metric` values. The z-scores are now slightly different due to `sqrt(n+1)` vs `sqrt(n)` and `(succ+1)/(n+1)` vs `(na+1)/(n+2)` denominators. ## Changes - `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula - `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)` - `repness.py`: Callers updated to pass raw counts `(na, ns)` instead of `(pa, ns, 0.5)` - `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added 8 test cases + edge case - `test_repness_unit.py`, `test_old_format_repness.py`: Updated for new signature - Golden snapshots re-recorded for all datasets ## Test plan - [x] D5 formula tests pass (8 input pairs + edge cases) - [x] D5 Clojure blob consistency check passes (all datasets) - [x] Full test suite passes (public + private, 19/19 regression tests) - [x] Only pre-existing failure: pakistan-incremental D2 (unrelated) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - RED: add D5 blob injection test (prop_test vs Clojure p-test values) - Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) - Update plan and journal: mark D5 as done - Plan: add D5 PR number and stack position to cross-reference commit-id:48b77ba3 * Fix D6: match Clojure two-proportion test formula (+1 pseudocount) ## Summary The Python `two_prop_test` used a standard two-proportion z-test with no pseudocounts, while Clojure's `stats/two-prop-test` (stats.clj:18-33) adds +1 to all four inputs (`succ-in`, `succ-out`, `pop-in`, `pop-out`) via `(map inc ...)` before computing the pooled z-test. This Laplace smoothing regularizes z-scores for small group sizes, which are common in Polis conversations. ## Changes - **Signature change**: `two_prop_test(p1, n1, p2, n2)` (proportions) → `two_prop_test(succ_in, succ_out, pop_in, pop_out)` (raw counts) - **Formula**: Standard pooled z-test on pseudocount-adjusted values: `pi1 = (succ_in+1)/(pop_in+1)`, `pi_hat = (s1+s2)/(p1+p2)` - **Callers updated**: Both scalar (`add_comparative_stats`) and vectorized (`compute_group_comment_stats_df`) now pass raw counts matching Clojure's `(stats/two-prop-test (:na in-stats) (sum :na rest-stats) (:ns in-stats) (sum :ns rest-stats))` (repness.clj:97-100) ## Affected output fields - `rat` (agree representativeness test z-score) - `rdt` (disagree representativeness test z-score) - `agree_metric`, `disagree_metric` (downstream of rat/rdt) ## Test plan - [x] Targeted D6 tests pass (formula, edge cases, regularization effect) - [x] Full test suite passes (excluding DynamoDB/MinIO tests) - [x] Private dataset tests pass (--include-local) - [x] Golden snapshots re-recorded for all 7 datasets 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - RED: add D6 blob injection test (two_prop_test vs Clojure repness-test) - Fix D6: match Clojure two-proportion test formula (+1 pseudocount) - Plan: add D6 PR number and stack position to cross-reference commit-id:23c03d70 * Fix D7: match Clojure repness metric formula (product of 4 signed values) ## Summary Changes the representativeness metric from a weighted sum of absolutes to the Clojure product formula (repness.clj:188-190). **Before (Python):** - agree_metric = `pa * (|pat| + |rat|)` — weighted sum, tolerant of weak factors - disagree_metric = `(1 - pd) * (|pdt| + |rdt|)` — doubly wrong: uses `(1-pd)` and sum **After (Clojure formula):** - agree_metric = `ra * rat * pa * pat` — product of 4 signed values - disagree_metric = `rd * rdt * pd * pdt` — product of 4 signed values The product formula is more conservative: any factor near zero kills the entire metric, requiring ALL dimensions (probability, significance, relative representativeness) to be strong simultaneously. The old disagree formula was doubly wrong: 1. Used `(1 - pd)` instead of `pd` — high metric when disagree probability is LOW 2. Used a weighted sum of absolutes instead of a signed product No feature flag for the old formula — it has no defensible behavior. ## Test plan - [x] 5 new D7 formula tests (agree product, disagree product, zero-kills-metric, sign preservation, multiple known values) - [x] Updated unit tests in test_repness_unit.py and test_old_format_repness.py - [x] Full test suite passes (excluding DynamoDB/MinIO tests) - [x] Private dataset tests pass (--include-local) - [x] Golden snapshots re-recorded for all 7 datasets 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Add PR description and update plan/journal for D7 fix commit-id:80eaa87c * Fix D8: match Clojure repful classification (rat > rdt) ## Summary Simplifies the repful ("representative for agree or disagree?") classification to match Clojure's `finalize-cmt-stats` (repness.clj:175-177). **Before (Python):** 3-branch conditional: 1. `pa > 0.5 AND ra > 1.0` → agree 2. `pd > 0.5 AND rd > 1.0` → disagree 3. Fallback: whichever metric is higher **After (Clojure):** `rat > rdt` → agree, else disagree. The old thresholds were redundant — `rat` and `rdt` (two-proportion z-scores) already encode whether the group's agree/disagree rate is significantly higher than other groups. The simple comparison is both correct and clearer. ## Changes - `repness.py`: `finalize_cmt_stats()` — 3-branch logic → `rat > rdt` - `repness.py`: Vectorized — `np.select` with conditions → `np.where(rat > rdt)` - `test_discrepancy_fixes.py`: Expanded from 2 to 6 tests (including edge cases: equal rat/rdt, both negative, both zero) - Golden snapshots re-recorded (repful direction changes for some comments) ## Test plan - [x] 6 targeted D8 tests pass (rat>rdt, rat<rdt, equal, both negative, both zero, old-vs-new divergence case) - [x] Full test suite passes (excluding DynamoDB/MinIO tests) - [x] Private dataset tests pass (--include-local) - [x] Golden snapshots re-recorded for all 7 datasets - [x] 19/19 regression tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Journal: add review session entry (2026-03-17) commit-id:9ec28252 * Fix D15: match Clojure moderation handling (zero out columns, don't remove) ## Summary Python's `_apply_moderation()` removed moderated-out comment columns entirely from `rating_mat`. Clojure's `zero-out-columns` (named_matrix.clj:214-230) sets all values in moderated columns to 0, preserving matrix structure. This fix changes Python to match: - Moderated-out comment columns are **zeroed** (values set to 0.0), not removed - `rating_mat` retains the same column count as `raw_rating_mat` - Moderated-out participants (rows) are still removed — unchanged ### Why zeroing matters - **Matrix dimensions**: Clojure's `rating-mat` has the same shape as `raw-rating-mat`. Downstream code (PCA, repness) processes the same-shaped matrix. - **tids list**: Column indices stay stable. Consumers depend on this. - **Practical impact**: Zeroed columns have no signal (na=0, nd=0), so they fail significance tests and are excluded from repness/consensus. PCA sees zero variance. ## Changes - `conversation.py`: `_apply_moderation()` — zero out columns instead of removing - `test_discrepancy_fixes.py`: 5 new synthetic tests + 2 enhanced real-data tests - `test_conversation.py`: Updated to expect zeroed columns ## Test plan - [x] Synthetic tests: zeroing preserves columns, values are 0, non-moderated unchanged - [x] Real-data test: biodiversity-incremental (169 mod-out comments) - [x] Full public test suite: 328 passed, 0 failed - [x] TDD cycle: RED (2 failures) → GREEN (all pass) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Fix D15: match Clojure moderation handling (zero out columns, don't remove) commit-id:c3450b9a * Fix K-means k divergence: preserve vote-encounter row order ## Summary - Fix K-means k divergence between Python and Clojure by preserving vote-encounter order for participant rows in the rating matrix - Python was using `natsorted()` (PID-numeric order) while Clojure's NamedMatrix preserves insertion order — different row ordering cascades into different first-k-distinct initialization seeds for group-level k-means - On vw: Python picked k=4 (wrong), Clojure picks k=2 — now both pick k=2 with identical cluster memberships ## Investigation findings The divergence chain: rating_mat row order → PCA projection order → base-cluster ID assignment → group k-means first-k-distinct init → different local optima → different silhouette landscape → different k. PCA components are identical (cosine similarity = 1.0), silhouette implementation matches, k-means algorithm matches — only the data ORDER feeding first-k-distinct differed. ## Changes - `conversation.py`: `update_votes()` preserves vote-encounter order for participant rows instead of `natsorted()` - `conversation.py`: `_apply_moderation()` preserves row order with list comprehension - Column (comment ID) ordering remains `natsorted` — doesn't affect clustering - Re-recorded vw cold-start blob and golden snapshots - Updated ordering tests, removed `test_group_clustering` xfail - Added `scripts/investigate_k_divergence.py` diagnostic tool ## Cold-start blob results | Dataset | Clj k | Py k | Match | |---------|-------|------|-------| | vw | 2 | 2 | exact (sizes [50,17]) | | biodiversity | 2 | 2 | exact (sizes [81,19]) | | bg2018 | 2 | 2 | close ([51,49] vs [52,48]) | | FLI | 2 | 3 | inherent PCA divergence (94.5% NaN, sil gap 0.001) | ## Test plan - [x] All 297 tests pass (0 failures, 58 xfailed) - [x] vw cold-start: k=2 exact match with Clojure blob - [x] biodiversity cold-start: k=2 exact match - [x] Ordering tests updated to expect encounter order - [ ] Re-record private dataset golden snapshots after stack rebase 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Fix K-means k divergence: preserve vote-encounter order for participant rows - Update plan and journal: K-divergence investigation resolved - Remove investigation script (one-off diagnostic, not production code) - Rename k-divergence doc: investigation record, not a handoff - Update references to renamed investigation doc commit-id:4598a0a1 * Docs: plan + journal updates for the 2026-06-09/10 stack session Plan and journal updates accumulated over the 2026-06-09 / 2026-06-10 sessions where the stack (#2516–#2524) was audited, recovered, and re-reviewed by Copilot. No code changes — just docs that summarize what landed in the math-fix PRs below this one and what's pending for the next stack. Major sections added/updated: CLJ-PARITY-FIXES-JOURNAL.md: - "Session: Audit + Recovery (2026-06-09)" — log of the 9-PR multi-agent audit (20 confirmed / 26 refuted findings), the structural discovery that PR #2521 D7 and PR #2522 D8 had landed as docs-only commits with zero code changes, the recovery work, and the in-session-discovered follow-ups (D5 n=0 short-circuit, D15 downstream regressions, to_dynamo_dict parallel inline code). - "Copilot review follow-up (2026-06-10)" — outcomes of two rounds of Copilot review across the stack (13 substantive comments + 7 stale historical ones), including the moderated-out-participants (mod_out_ptpts) leak Copilot caught in PR #2523 and the serialization-shape regression test added to pin to_dict / to_dynamo_dict outputs against silent drift. - "Operational impact of the mod_out_ptpts fix (prodclone verification, 2026-06-10)" — aggregate evidence that 67 conversations in prodclone have at least one banned participant, some still actively voting in 2025; the Python-side fix isn't strict Clojure parity (Clojure has the same latent leak) but is the correct Polis behavior. PLAN_DISCREPANCY_FIXES.md: - All D-row entries updated to current status (D5, D6, D7, D8, D9, D15 now DONE; D7 and D8 specifically recovered from the docs-only state the previous PRs left them in). - Stale PR-number cross-references corrected (#2446 -> #2518 for D9, #2453 -> #2524 for K-divergence). - "Audit-Discovered Issues (2026-06-09)" section — final state of all four cleanup PRs (#2516, #2517, #2523 follow-up, #2524 cleanups). - "Pending — needs team discussion (2026-06-10)" section — the participant-mod / banned-participant semantics gap (waiting on Colin), golden-snapshot re-recording, scalar/vectorized pi_hat > 1 edge case, to_dynamo_dict (now resolved). Stack-tip docs PR to ride on top of the math fixes when the stack merges. commit-id:0ccf8c0a
Summary
stats.cljZ_90: 1.645 → 1.2816,Z_95: 1.96 → 1.6449stats.pyalready used the correct one-tailed valuesWhy one-tailed?
The proportion tests in Polis check whether a comment's agree (or disagree) rate is significantly above 0.5 — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.
Test plan
🤖 Generated with Claude Code